class: center, middle, inverse, title-slide #
Applied Machine Learning in R
## .big[👨💻 🤖 👩🏫] ### Pittsburgh Summer Methodology Series ### Lecture 1-A July 19, 2021 --- class: inverse, center, middle # Workshop Overview <style type="text/css"> .onecol { font-size: 26px; } .twocol { font-size: 24px; } .remark-code { font-size: 24px; border: 1px solid grey; } a { background-color: lightblue; } .remark-inline-code { background-color: white; } </style> --- class: twocol ## Jeffrey Girard .pull-left[ Assistant Professor<br /> University of Kansas **Research Areas** - Affective Science - Clinical Psychology - Computer Science **Machine Learning** - Recognition of Facial Expressions - Prediction of Emotional States - Prediction of Mental Health Status ] .pull-right[ .center[ <img src="data:image/png;base64,#jg_headshot.jpeg" width="300" height="300" /> [www.jmgirard.com](https://www.jmgirard.com)<br /> [jmgirard@ku.edu](mailto:jmgirard@ku.edu)<br /> [@jeffreymgirard](https://twitter.com/jeffreymgirard) ] ] --- class: twocol ##Shirley Wang .pull-left[ Doctoral Candidate<br /> Harvard University **Research Areas** - Clinical Psychology - Computational Psychiatry - Mathematical Modeling **Machine Learning** - Prediction of suicide risk - Prediction of longitudinal illness course - Idiographic prediction ] .pull-right[ .center[ <img src="data:image/png;base64,#sw_headshot.jpeg" width="300" height="300" /> [shirleywang.rbind.io](https://shirleywang.rbind.io/)<br /> [shirleywang@g.harvard.edu](mailto:shirleywang@g.harvard.edu)<br /> [@ShirleyBWang](https://twitter.com/ShirleyBWang) ] ] --- class: twocol ## Goals and Timeline .pull-left[ **Build a foundation** of concepts and skills **Describe every step** from start to finish Emphasize **practical and applied** aspects Provide intuitions rather than lots of theory Dive deeper into a few algorithms Highlight algorithms good for beginners Communicate the pros and cons of choices ] -- .pull-right[ <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Topic </th> <th style="text-align:left;"> Lead </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1-A </td> <td style="text-align:left;"> Conceptual introductions </td> <td style="text-align:left;"> JG </td> </tr> <tr> <td style="text-align:left;"> 1-B </td> <td style="text-align:left;"> Logistics and data exploration </td> <td style="text-align:left;"> SW </td> </tr> <tr> <td style="text-align:left;"> 2-A </td> <td style="text-align:left;"> Feature engineering </td> <td style="text-align:left;"> JG </td> </tr> <tr> <td style="text-align:left;"> 2-B </td> <td style="text-align:left;"> Predictive modeling basics </td> <td style="text-align:left;"> JG </td> </tr> <tr> <td style="text-align:left;"> 3-A </td> <td style="text-align:left;"> Regularized regression </td> <td style="text-align:left;"> SW </td> </tr> <tr> <td style="text-align:left;"> 3-B </td> <td style="text-align:left;"> Decision trees and random forests </td> <td style="text-align:left;"> SW </td> </tr> <tr> <td style="text-align:left;"> 4-A </td> <td style="text-align:left;"> Support vector machines </td> <td style="text-align:left;"> JG </td> </tr> <tr> <td style="text-align:left;"> 4-B </td> <td style="text-align:left;"> Practical matters and advice </td> <td style="text-align:left;"> SW </td> </tr> <tr> <td style="text-align:left;"> 5-A </td> <td style="text-align:left;"> Panel Q&A and discussion </td> <td style="text-align:left;"> Both </td> </tr> <tr> <td style="text-align:left;"> 5-B </td> <td style="text-align:left;"> Hackathon and consultation </td> <td style="text-align:left;"> Both </td> </tr> </tbody> </table> ] --- class: twocol ## Format and Materials .pull-left[ Each workshop day will have **two parts** Most parts will have **lecture** and **live coding** Most parts will have hands-on **activities** We will take a **~10m break** after the first part ] -- .pull-right[ Course materials are on Github and OSF - [github.com/ShirleyBWang/pittmethods_ml](https://github.com/ShirleyBWang/pittmethods_ml) - [osf.io/3qhc8](https://osf.io/3qhc8) You can download and re-use the materials according to our "CC-By Attribution" license ] .footnote[A few inspirations for this workshop include [Applied Predictive Modeling](http://appliedpredictivemodeling.com/), [Tidy Modeling with R](https://www.tmwr.org/), and [StatQuest](https://statquest.org/).] --- class: twocol ## Etiquette and Responsibilities .pull-left[ **Behave professionally** at all times Stay on topic and **minimize distractions** **Stay muted** unless talking to minimize noise Ask **questions** in chat or use "Raise Hand" Be **respectful** to everyone in the workshop Be **patient** with yourself and others ] -- .pull-right[  - Be **treated with respect** at all times - Turn your **camera on or off** - **Arrive and depart** whenever needed - **Ask for help** with workshop content - **Share your opinions** respectfully - **Reuse materials** according to the license - Receive **reasonable accommodations** - **Contact the instructors** by email ] --- class: inverse, center, middle # Icebreakers --- class: onecol ## Icebreakers We will randomly assign everyone to one of two breakout rooms Each person will have *up to one minute* to introduce themselves In addition to **sharing your name**, please answer these questions: 1. Where are you joining us from? 2. What field(s) do you work in? 3. What is one of your research interests? 4. What is one of your personal interests? The instructor will go first and call on attendees to go next If you would prefer not to share, please indicate that in chat --- class: inverse, center, middle # Conceptual Introduction --- class: onecol ## What is machine learning? The field of machine learning (ML) is a **branch of computer science** ML researchers **develop algorithms** with the capacity to  When algorithms learn from (i.e., are **trained on**) data, they create **models**<sup>1</sup> <p style="padding-top:20px;">This workshop is all about applying ML algorithms to create </p> The goal will be to **predict unknown values** of important variables **in new data** .footnote[ [1] ML models are commonly used for prediction, data mining, and data generation. ] --- class: twocol .pull-left[ ## Labels / Outcomes Labels are variables we <br />the values of (because they are unknown) Labels tend to be expensive or difficult to measure in new data (though are known in some existing data that we can learn from) AKA outcome, dependent, or `\(y\)` variables <img src="data:image/png;base64,#label_icons.png" width="100%" /> ] -- .pull-right[ ## Features / Predictors Features are variables we <br />the unknown values of the label variables Features tend to be relatively cheaper and easier to measure in new data than labels (and are also known in some existing data) AKA predictor, independent, or `\(x\)` variables <img src="data:image/png;base64,#feature_icons.png" width="100%" /> ] --- class: twocol ## Modes of Predictive Modeling .pull-left[ When labels have continuous values, predicting them is called  <img src="data:image/png;base64,#regression_diagram.png" width="100%" /> - *How much will a customer spend?* - *What GPA will a student achieve?* - *How long will a patient be hospitalized?* ] -- .pull-right[ When labels have categorical values, predicting them is called  <img src="data:image/png;base64,#classification_diagram.png" width="100%" /> - *Is an email spam or non-spam?* - *Which candidate will a user vote for?* - *Is a patient's glucose low, normal, or high?* ] .footnote[*Unsupervised learning* (AKA data mining) has no explicit labels and just looks for patterns within the features.] --- class: onecol ## Modes of Predictive Modeling .pull-left[ .center[**Regression**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/regression_example-1.png" width="100%" /> ] -- .pull-right[ .center[**Classification**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/classification_example-1.png" width="100%" /> ] --- ## Comprehension Check \#1 <span style="font-size:30px;">Ann has developed an ML system that looks at a patient's physiological signals and tries to determine whether they are having a micro-seizure.</span> .pull-left[ ### Question 1 **The features are <span style="text-decoration: underline; white-space: pre;"> </span> and the labels are <span style="text-decoration: underline; white-space: pre;"> </span>.** a) physiological signals; physiological signals b) physiological signals; micro-seizure (yes/no) c) micro-seizure (yes/no); physiological signals d) micro-seizure (yes/no); micro-seizure (yes/no) ] .pull-right[ ### Question 2 **Which "mode" of predictive modeling is this?** a) Regression b) Classification c) Unsupervised learning d) All of the above ] --- class: inverse, center, middle # Modeling Workflow --- ## Typical ML Workflow <br /><br /> .center[ <img src="data:image/png;base64,#workflow.png" width="100%" /> ] --- class: onecol ## Exploratory Analysis .left-column[ <br /> <img src="data:image/png;base64,#explore.jpg" width="100%" /> ] .right-column[ **Verify the quality of your variables** - Examine the distributions of feature and label variables - Look for errors, outliers, missing data, etc. **Gain inspiration for your model** - Identify relevant features for a label - Detect highly correlated features - Determine the "shape" of relationships ] --- class: onecol ## Feature Engineering .left-column[ <br /> <img src="data:image/png;base64,#engineer.jpg" width="100%" /> ] .right-column[ **Prepare the features for analysis** - *Extract* features - *Transform* features - *Re-encode* features - *Combine* features - *Reduce* feature dimensionality - *Impute* missing feature values - *Select* and drop features ] --- class: onecol ## Model Development .left-column[ <br /> <img src="data:image/png;base64,#develop.jpg" width="100%" /> ] .right-column[ **Choose algorithms, software, and architecture** - Elastic Net and/or Random Forest - `caret` or `tidymodels`, `elasticnet` or `glmnet` - Regression or classification **Train the model by estimating parameters** - Learn the nature of the feature-label relationships - For instance, estimate the intercept and slopes ] --- class: onecol ## Model Tuning .left-column[ <br /> <img src="data:image/png;base64,#tune.jpg" width="100%" /> ] .right-column[ **Determine how complex the model can become** - How many features to include in the model - How complex the shape of relationships can be - How many features can interact together - How much to penalize adding more complexity **Make other decisions in a data-driven manner** - Which of three algorithms should be preferred - Which optimization method should be used ] --- class: onecol ## Model Evaluation .left-column[ <img src="data:image/png;base64,#target.jpg" width="100%" /> ] .right-column[ **Decide how to quantify predictive performance** - In regression, performance is based on the errors/residuals - In classification, performance is based on the confusion matrix **Determine how successful your predictive model was** - Compare predictions (i.e., predicted labels) to trusted labels - Compare the performance of one model to another model ] --- ## Comprehension Check \#2 <span style="font-size:30px;">Yuki trained an algorithm to predict the number of "likes" a tweet will receive based on measures of the tweet's formatting and content.</span> .pull-left[ ### Question 1 **Calculating the length of each tweet is <span style="text-decoration: underline; white-space: pre;"> </span>?** a) Feature Engineering b) Model Development c) Model Tuning d) Model Evaluation ] .pull-right[ ### Question 2 **When should problems with the data be found?** a) Model Evaluation b) Model Tuning c) Model Development d) Exploratory Analysis ] --- class: inverse, center, middle # Signal and Noise --- class: onecol ## A Delicate Balance Any data we collect will contain a mixture of **signal** and **noise** - The "signal" represents informative patterns that generalize to new data - The "noise" represents distracting patterns specific to the original data We want to capture as much signal and as little noise as possible -- <p style="padding-top:30px;">More complex models will allow us to capture <b>more signal</b> but also <b>more noise</b></p> : If our model is too complex, we will capture unwanted noise : If our model is too simple, we will miss important signal --- ## Model Complexity <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/complexity-1.png" width="100%" /> --- class: twocol ## A Super Metaphor .pull-left[ What makes machine learning so amazing is its **ability to learn complex patterns** However, with this great power and flexibility comes the looming **danger of overfitting** Thus, much of ML research is about finding ways to **detect** and **counteract** overfitting For detection, we need two sets of data: : used to learn relationships : used to evaluate performance ] .pull-right[ .center[ <img src="data:image/png;base64,#kryptonite.png" width="67%" /> ] ] --- layout: true ## An Example of Overfitting --- <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex1-1.png" width="100%" /> --- .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex2-1.png" width="100%" /> ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex3-1.png" width="100%" /> ] --- count: false .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex4-1.png" width="100%" /> Total error on training data = 31.2 () ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex5-1.png" width="100%" /> Total error on training data = 0.0 () ] --- <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex6-1.png" width="100%" /> --- .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex7-1.png" width="100%" /> ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex8-1.png" width="100%" /> ] --- count: false .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex9-1.png" width="100%" /> Total error on testing set = 17.4 () ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#Day_1A_Slides_files/figure-html/overex10-1.png" width="100%" /> Total error on testing set = 41.6 () ] --- layout: false class: onecol ## Conclusions from Example In ML,  is a lack of predictive accuracy in the original data (the "training set") In ML,  is a lack of predictive accuracy in new data (the "testing set") -- <p style="padding-top:25px;">An ideal predictive model would have both low bias and low variance</p> However, there is often an inherent <b>trade-off between bias and variance</b><sup>1</sup> .footnote[[1] To increase our testing set performance, we often need to worsen our performance in the training set.] -- <p style="padding-top:25px;">We want to find the model that is <b>as simple as possible</b> but <b>no simpler</b></p> --- ## A Graphical Explanation of Overfitting <img src="data:image/png;base64,#overfitting.png" width="100%" /> --- ## A Meme-based Explanation of Overfitting <img src="data:image/png;base64,#overfitting_lay.png" width="100%" /> --- ## Comprehension Check \#3 <span style="font-size:30px;">Sam used all emails in his inbox to create an ML model to classify emails as "work-related" or "personal." Its accuracy on these emails was 98%.</span> .pull-left[ ### Question 1 **Is Sam done with this ML project?** a) Yes, he should sell this model right now! b) No, he needs to create a training set c) No, he needs to test the model on new data d) No, his model needs to capture more noise ] .pull-right[ ### Question 2 **Which problems has Sam already addressed?** a) Overfitting b) Underfitting c) Variance d) All of the above ] --- class: inverse, center, middle # Countering Overfitting --- class: onecol ## Cross-Validation There are some clever algorithmic tricks to prevent overfitting - For example, we can penalize the model for adding complexity The main approach, however, is to use : -- - Multiple **fully independent** sets of data are created (by subsetting or resampling) - Some sets are used for training (and tuning) and other sets are used for testing - **Model evaluation is always done on data that were not used to train the model** - This way, if performance looks good, we can worry less about variance/overfitting<sup>1</sup> .footnote[[1] Although we always need to worry somewhat about whether the original data was itself representative.] --- # Holdout Cross-Validation <img src="data:image/png;base64,#holdout.png" width="100%" /> --- # Holdout Cross-Validation <img src="data:image/png;base64,#holdout2.png" width="100%" /> --- # k-fold Cross-Validation <img src="data:image/png;base64,#kfold1.png" width="100%" /> --- count: false # k-fold Cross-Validation .center[ <img src="data:image/png;base64,#kfold2.png" width="70%" /> ] --- class: onecol ## Advanced Cross-Validation Cross-validation can also be **nested** to let the model tune on unseen data: - An outer loop (applied to the original data) is used for *model evaluation* - An inner loop (applied to the training set) is used for *model tuning* Cross-validation can also be **stratified** to keep the sets relatively similar Cross-validation can also be **repeated** to avoid problems with any single split .footnote[A great default procedure is nested, stratified, and 3x repeated 10-fold cross-validation.] --- ## Comprehension Check \#4 <span style="font-size:30px;">Bogdan collects data from 1000 patients. He assigns patients 1 to 800<br />to be in his training set and patients 700 to 1000 to be in his testing set.</span> .pull-left[ ### Question 1 **What major mistake did Bogdan make?** a) He used a testing set instead of a holdout set b) Some patients are in both training and testing c) The two subsets of data have different sizes d) He did not use k-fold cross-validation ] .pull-right[ ### Question 2 **Which step should not be done in the training set?** a) Exploratory Analysis b) Feature Engineering c) Model Development d) Model Evaluation ] --- class: inverse, center, middle # Small Group Discussion --- class: onecol # Small Group Discussion We will randomly assign you to a small breakout room We will jump between rooms to join discussions and answer questions **Introduce yourselves again and discuss the following topics** 1. What types of labels and features would you like to work with? 2. What problems might predictive modeling help your field solve? 3. Do you have any questions or comments about the material so far? --- class: inverse, center, middle # Time for a Break!
10
:
00